WAIS (Wide Area Information Service) is a tool that enables users to search for
keywords in a database of documents available on your system and show the results. WAIS
was developed by Thinking Machines but spun off to a separate company called WAIS Inc.
when it became immensely popular. A free version of WAIS was made available to the
Clearinghouse for Networking Information Discovery and Retrieval (CNIDR) as freeWAIS,
which is the version most often found on Linux systems.
WAIS lets a user enter some keywords or phrases, and then searches a database for those
terms. A typical WAIS search screen is shown in Figure 44.1. (This screen is a
Windows-based Web browser accessing the primary WAIS server at http://www.wais.com. This
server is a good place to look for examples of how you can use WAIS.) This example shows a
search for the keyword "hubble" (WAIS usually ignores case). You can enter long
and complex or relatively simple search criteria on a WAIS search line. After searching
all the database indexes it knows about, WAIS shows its results, as shown in Figure 44.2.
This screen shows the search results based on the keywords "hubble" and
"magnitude."
Figure 44.1.
A search screen from a WAIS site.
The display generated by WAIS, often displayed in a WWW browser or a WAIS browser as in
these figures, gives each match with a score from 0 to 1000 indicating the manner in which
the keywords match the index (the higher numbers are better matches). Users can then
refine the list, expand it, or examine documents listed. In Figure 44.3, one of the
documents listed in the search results is displayed in the WWW browser window. WAIS can
handle many file formats, including text and documents, audio, JPEG and GIF files, and
binaries.
The Linux version of WAIS is called freeWAIS. This chapter looks at how you can set up
a freeWAIS server on your Linux machine. WAIS is a useful tool to provide if you deal with
a considerable amount of information you want to make generally available. This
information could be product information, details about a hobby, or practically any other
type of data. All you have to want to do is make it available to others, either on your
local area network or to the Internet as a whole.
The freeWAIS package has three parts to it: an indexer, a WAIS service, and a client.
The indexer handles database information and generates an index that contains keywords and
a table indicating the word's occurrences. The service component does the matching between
a user's requests and the indexed files. The client is the user's vehicle to access WAIS
and is usually a WAIS or WWW browser. WWW browsers usually have an advantage over WAIS
browsers in that the latter cannot display HTML documents.
A follow-up backward-compatible WAIS system is currently available in beta version
called ZDIST. ZDIST's behavior will be much like freeWAIS, with any changes noted in the
documentation. ZDIST adds some new features and is a little smaller and faster than
freeWAIS. Because of the unstable beta nature of ZDIST, this chapter concentrates on
freeWAIS.
The freeWAIS software is often included in a complete Linux distribution CD-ROM and is
readily available from many FTP and BBS sites. Alternatively, you can get it by anonymous
FTP from the CNIDR site as ftp.chidr.org. The freeWAIS system resides in the directory
/pub/NDIR.tools/freewais/freeWAIS-X.X.tar.Z where X.X is the latest version number. The
CNIDR site has many binaries available for different machines, as well as generic source
code that can be tailored to many different systems.
One of the files in the distribution software, which should be placed in the
destination directory, is the Makefile used to create the program. If you are compiling
the freeWAIS source yourself, examine the Makefile to ensure that the variables are set
correctly. Most are fine by default, pointing to standard Linux utilities. The exceptions
that you may have to tweak yourself are as follows:
| CC | This variable is the name of the C compiler you use (usually cc or gcc). |
| CURSELIB | This variable should be set to the current version of the curses library on your system. |
| RESOLVER | Leave this variable blank unless you are using a namespace resolver, in which case set it to -lresolc. |
| TOP | This variable specifies the full path to the freeWAIS source directory. |
The CFLAGS options enable you to specify compiler flags when the freeWAIS source is
compiled. Many options are supported, all of which are explained in the documentation
files that accompany the source. Most of the flag settings can be left as their default
values in Linux systems. A few of the specific flags you may want to alter are worth
mentioning, though. The most useful are the indexer flags, of which two are potentially
useful:
| -DBIO | This flag is used to allow indexing on biological symbols and terms. Use it only if your site will deal with biological documents. |
| -DBOOLEANS | This flag enables you to use Booleans such as AND and NOT. This flag can be handy for extending the power of searches. |
The -DBOOLEANS flag handles logical searches. For example, if you are looking for the
keywords green leaf, WAIS by default searches for the words green and leaf separately and
judges matches on the two words independently. With the -DBOOLEANS flag set, you can join
the two words with AND so that a match has to be with green leaf.
A couple of other flags that may be useful for freeWAIS sites deal with the behavior of
the system as a whole:
| -DBIGINDER | Set this flag when there are many (thousands) of documents to index. |
| -DLITERAL | This flag allows a literal search for a string, as opposed to using partial hits on the string's component words. |
| -DPARTIALWORD | This flag allows searches with asterisks as wildcards (such as "auto*"). |
| -DRELEVANCE_FEEDBACK | When this flag is set to ON, it allows clients to use previous search results as search criteria for a new search. |
A number of directories are included in the distribution software, most of which are of
obvious intent (bin for binaries, man for man pages, and so on). The directories used by
freeWAIS in its default configuration are:
| bin | Binaries |
| config.c | C source code for configuration |
| doc | Doc files, help files, and FAQs |
| include | Header files used by the compiler |
| lib | Library files |
| man | Man pages |
| src | freeWAIS source code |
| wais-sources | Directory of Internet servers |
| wais-test | Sample indexer and service scripts |
Once you have fine-tuned the configuration file information, you can compile the
freeWAIS source with the make command:
make default
By default, the make utility compiles two clients called swais and waisq. If you want
to compile an X version of WAIS called xwais (useful if you want to allow access from X
terminals or consoles), uncomment the line in the Makefile that ends with makex.
When you have the compiled freeWAIS components installed and configured properly, you
can begin setting up the WAIS index files to documents available on your system. Start by
creating an index directory whose default name is wsindex. The directory usually resides
just under the root of the filesystem (/wsindex), but many administrators like to keep it
in a reserved area for the WAIS software (such as /usr/wais/wsindex). If the index files
are difficult to locate, problems can result when users try to find them.
The wais-test directory created when you installed freeWAIS contains a script called
test.waisindex that creates four WAIS index files for you automatically. You use these
files to test the WAIS installation for proper functionality. They can also show you how
to use the different search and index capabilities of freeWAIS. The following are the four
index files:
Only graphically based browsers (usually X-based) can handle the multidocument formats,
although any type of browser should be able to handle the other three index formats.
Once you have verified that the indexing system works properly and all the components
of freeWAIS are properly installed, you need to build an index file for the documents
available on your system. You can do this with the waisindex command. The waisindex
command enables you to index files two ways by using the -t option followed by one of
these keywords:
The waisindex command takes arguments for the name of the destination index file (-d
followed by the filename), and the directory or files to be indexed. For example, to index
a directory called /usr/sales/sales_lit into a destination index file called sales using
the one_line indexing approach, you would issue the following command:
waisindex -d sales -t one_line /usr/sales/sales_lit
Because no path is provided for the sales index file in this example, it would be
stored in the current directory.
Once you have started the WAIS server software (see "Starting freeWAIS"
below), you can test newly created indexes. To test the indexes, use the waissearch
command. For example, to look for the word "WAIS" in the index files, issue the
following command:
waissearch -p 210 -d index_file WAIS
In this example, -p gives the port number (default value is 210), and -d is the path to
the index file. If the search was successful (and you have something that matches), you
will see messages about the number of records returned, and the scores of each match. If
you see error messages or nothing, check the configuration information and the index
files.
A final step you can take if you want Internet users to be able to access your freeWAIS
system is to issue the following command:
waisindex -export -register Filenames
In this example, Filenames is the name of the index. This name is registered with the
Directory of Servers at cnidr.org and quake.think.com. These addresses are reached
automatically with the -register option. Do this step only if you want all Internet users
to be accessing your WAIS service. (See the section "The waisindex Command" for
more information on the waisindex command.)
If you want to allow clients to connect to your freeWAIS system with a WWW browser
(such as Mosaic or Netscape) and access HTML sources on your system through WAIS, you must
issue the following command:
waisindex -d WWW -T HTML -contents -export /usr/resources/*html
This line enables WAIS clients to perform keyword searches on HTML documents as well.
If you want, you can set WAIS to allow only certain domains to connect to it. You can
do this in the ir.h file, which has a line like the following:
define SERVSECURITYFILE "SERV_SEC"
This line is commented out by default. Remove the comment symbol. You have to place a
copy of an existing SERV_SEC file or one you create yourself in the same directory as the
WAIS index files. If there is no SERV_SEC file accessible to WAIS, all domains are allowed
access. (You can change the name of the file, of course, as long as the entry in ir.h
matches the filename with quotation marks around it.)
Each ASCII entry in the SERV_SEC file follows a strict format for defining the domains
that are granted access to WAIS. The format of each line is as follows:
domain [IP address}
Each line has the domain name of the host to which you want to grant access, with its
IP address an optional add-on to the line. If the domain name and IP address do not match,
it doesn't matter because WAIS allows access to a match of either name or address. A
sample SERV_SEC file looks likes this:
chatton.com roy.sailing.org bighost.bignet.com
Each of these three domain names can access WAIS, but any connection from a host
without these domain names is refused.
The SERV_SEC file should be owned and accessible by the login name and group the
freeWAIS system is run under (it should not be run as root to avoid security problems),
and the file should be modifiable only by root. In other words, if you are letting
freeWAIS run under the login waismgr, all the files should be owned by the user waismgr
and that login's group (which ideally would be unique for extra security). The files
should not have write access for user, group, or other (making root the only login that
can write these files).
Similar to the SERVSECURITYFILE variable is DATASECURITYFILE, which controls access to
the databases. Again, there is a line in the ir.h file that you should uncomment to look
like the following:
#define DATASECURITYFILE "DATA_SEC"
DATA_SEC is a file listing each database file and the domains that have access to it.
The file should reside in the same directory as the index files. The format of the
DATA_SEC file is as follows:
database domain [IP address]
In this example, database is the name of the database the permissions refer to, and
domain and optional IP address are the same as the SERV_SEC file. A sample DATA_SEC file
looks like the following:
primary chatton.com primary bignet.org primary roy.sailing.org sailing roy.sailing.org
In this example, three domains are granted access to a database called primary (note
that primary is just a filename and has no special meaning), and one domain has access to
the database called sailing. If you want to allow all hosts with access to the system
(controlled by SERV_SEC) to access a particular database, you can use asterisks in the
domain name and IP address fields. For example, the following entries allow anyone with
access to WAIS to use the primary database, with one domain only allowed access to the
sailing database:
primary * sailing roy.sailing.org
In both the SERV_SEC and DATA_SEC files, you have to be careful with the IP addresses
to avoid inadvertently granting access to hosts not wanted on your system. For example, if
you specify the IP address 150.12 in your file, then any IP address from 150.12 through
150.120, 151.121, and so on are also granted access as they match the IP components.
Specify IP addresses explicitly to avoid this problem.
As with the FTP services, you can set freeWAIS to start up when the system boots by
using the rc files from the command line at any time, or you can have inetd start the
processes when a service request arrives. If you want to start freeWAIS from the command
line, you need to specify a number of options. A sample startup command line looks like
this:
waisserver -u username -p 210 -l 10 -d /usr/wais/wais_index
The -u option tells waisserver to run as the user username (which has to be a
valid user in /etc/passwd, of course). The -p option tells waisserver what port to use
(the default is 210, as shown in the /etc/services file). The -d option shows the default
location of WAIS indexes. If you want to invoke logging of sessions to a file, use the -e
option followed by the name of the logfile.
You should run waisserver as another user instead of root to prevent holes in the WAIS
system being exploited by a hacker. If the service is run as a standard user (such as
wais), only the files that the user would have access to are in jeopardy.
If the port for waisserver is set to 210, the service corresponds to the Internet
standards for access. If you set the value to another port, you can configure the system
for local area access only. If the port number is less than 1023, root must start and
manage the WAIS service, but any port over 1023 can be handled by a normal user. If you
intend to use port 210, you don't have to specify the number in the command line, although
you must still use the -p option.
If you want to let inetd handle the waisserver startup, you need to ensure that the
file /etc/services has an entry for WAIS. The line in the /etc/services file will look
like this:
z3950 210/tcp #WAIS
In this example, 210 is the port number WAIS uses, and tcp is the protocol. After
modifying or verifying the entry in /etc/services, you need to add a WAIS entry to the
inetd.conf file to start up waisserver whenever a request is received on port 210 (or
whatever other port you are using). The entry looks like this:
z3950 stream tcp nowait root /usr/local/bin/waisserver/waisserver.d -u username -d /usr/wais/wais_index
The options are the same as for the command line startup mentioned earlier. The daemon
waisserver.d is used when starting up in that mode, instead of waisserver. Again, you can
use the -e option to log activity to a file.
Once you have the freeWAIS server ready to run and everything seems to be working, it's
time to provide some content for your WAIS system. Usually, documents are the primary
source of information for WAIS, although you can index any type of file. The key step to
providing WAIS service is to build the WAIS index using the waisindex command. The
waisindex command can be a bit obtuse at times, but a little practice and some trial and
error will help you master its somewhat awkward behavior.
The waisindex program works by examining all the data in the files for which you want
to create an index. From its examination, waisindex usually generates seven different
index files (depending on the content and your commands). Each file holds a list of unique
words in the documents. The different index files are then combined into one large
database, often called the source (or WAIS source). Whenever a client WAIS package submits
a search, the search strings are compared to the source and the results displayed with
accuracy analysis.
The use of waisindex allows a client search to proceed much faster because the keywords in the data files have already been extracted. However, the mass of data in the index files can be sizable, so allow lots of disk space for a WAIS server to work with.
A system user usually cannot read the freeWAIS index files (although one or two files
can be read with some success). Usually, waisindex creates seven index files, although the
number may vary depending on requirements. The index files all have a specific file
extension to show their purpose, based on a root name (specified on the waisindex command
line, or defaulting to "index"). The index files and their purposes are
described in the following list:
The source description file is a standard ASCII file that is read by waisindex at
intervals to see if information has changed. If the changes are significant, waisindex
updates its internal information. A type source file looks like this:
(:source :version 2 :ip-address "147.120.0.10" :ip-name: "wizard.tpci.com" :tcp-port 210 :database-name "Linux stuff" :cost 0.00 :cost-unit: free :maintainer "wais_help@tpci.com" :subjects "Everything you need to know about Linux" :description "If you need to know something about Linux, it's here."
You'll want to edit this file when you set up freeWAIS because the default descriptions
are rather spare and useless.
The waisindex command provides a number of options, some of which you have seen earlier
in this chapter. The primary waisindex options of interest to most users are the
following:
You must tell the waisindex program what type of information is in a file, or it may
not be able to generate an index properly. Many filetypes are currently defined with
freeWAIS, which you can display by entering the command with no argument:
waisindex
Although many different types are supported by freeWAIS, only a few are really in
common use. The most common file types supported by freeWAIS are the following:
To tell waisindex the type of file to be examined, use the -t option followed by the
proper type. For example, to index standard ASCII text, you could use the command:
waisindex -t text -r /usr/waisdata/*
This command indexes all the files in /usr/waisdata recursively, assuming they are all
ASCII files.
When a document has been indexed, any changes in the document will not be reflected in the WAIS index unless a complete reindex is performed. Using the -a option does not update existing index entries. Instead, start the index process again. You should do this at periodic intervals as a matter of course.
You can provide some extra features for users of your freeWAIS service in a number of
ways. Although this section is not exhaustive by any means, it will show you two of the
easily implementable features that make a WAIS site more attractive.
To begin, suppose you want to make video, graphics, or audio available on a particular
subject. As an example, imagine that your site deals with musical instruments and you have
lots of documents on violins. You may want to provide an audio clip of a violin being
played, a video of the making of a violin body, or a graphic image of a Stradivarius
violin. To make these extra files available, you should have all the files with the same
filename but different extensions. For example, if your primary document on violins is
called violins.txt, you may have the following files in the WAIS directories:
| violins.TEXT | Document describing violins |
| violins.TIFF | Image of a Stradivarius |
| violins.MPEG | Video of the making of a violin body |
| violins.MIDI | MIDI file of a violin being played |
All these files should have the same root name (violins) but different types
(recognized by waisindex). Then you have to associate the multimedia files with the
document file. You can do this with the following command:
waisindex -d violin -M TEXT,TIFF,MPEG,MIDI -export /usr/waisdata/violin/*
This tells waisindex that all four types of files are to be handled. When a user
searches for the keyword violin, all four types of files will be matched, and options on
the browser may let them play, view, or hear the non-text components.
Another common feature is the use of synonyms to account for different methods of
specifying a subject. For example, a scientist may use the keyword feline, but a
non-scientist may use cat. You want to be able to match these two words to the same thing.
You can do this through a file called SOURCE.syn, which is automatically read by the
search engine when it is working. The SOURCE.syn file has the following format:
word synonym [synonym ...]
Here, word is the word to be used to search the databases, and synonym is the word(s)
that should match it. For example, if you are dealing with domestic pets in your WAIS
site, you may have the following entries in the SOURCE.syn file:
cat feline dog canine hound pooch bord parrot budgie
The synonym file can be very useful when people use different terms to refer to the
same thing. An easy way to check for the need for synonyms is to set the logging option for waisindex to 10 for a while, and see what words people are using on
your site. Don't keep it on too long, as the logfiles can become enormous with a little
traffic.
Now that WAIS is up and running on your server, you can go about the process of building your index files and letting others access your server. WAIS is quite easy to manage, and offers a good way of letting other users access your system's documents. The alternative approach, for text-based systems, is Gopher, which you examine in the next chapter.